-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a diagnostic kstat for obtaining pool status #16026
base: master
Are you sure you want to change the base?
Conversation
This kstat output does not require taking the spa_namespace lock, as in the case for 'zpool status'. It can be used for investigations when pools are in a hung state while holding global locks required for a traditional 'zpool status' to proceed. This kstat is not safe to use in conditions where pools are in the process of configuration changes (i.e., adding/removing devices). Therefore, this kstat is not intended to be a general replacement or alternative to using 'zpool status'. Sponsored-by: Wasabi Technology, Inc. Sponsored-By: Klara Inc. Co-authored-by: Don Brady <don.brady@klarasystems.com> Signed-off-by: Don Brady <don.brady@klarasystems.com>
Overall, I think JSON output is a good thing. It's something that's been on the ZFS wishlist since the dawn of time, and there have been some aborted attempts over the years to implement it. Just some initial thoughts before I look at the code:
|
One major advantage that I see to this being a lockless kstat file is the ability for it to be parsed and used by metrics exporters like Frequently, these tools are deployed in containers, where |
Just thinking out loud - we could potentially combine the two, and have |
JFYI, if anybody is interested in carrying this work forward, it might be worth taking a look at #16484. |
Motivation and Context
A hung pool process can be left holding the spa config lock or the spa namespace lock. If an admin wants to observe the status of a pool using the traditional zpool status, it could hang waiting for one of the locks held by the stuck process. It would be nice to observe pool status in this scenario without the risk of the inquiry hanging.
Description
Exploring Solutions
Infer that the lock is stuck (held for an extended period) and conclude that locking is not required to read the pool stats. This is somewhat a variant of 1, where the source code, instead of the admin user, is determining that it is safe to ignore locking since the pool configuration cannot be changing.
Refactor the spa code to have more fine grain locking and perhaps use reader/writer locks in lieu of mutex locks to alleviate the obvious points of lock contention when a pool gets stuck. Don't hold these global scope locks across disk I/O, etc.
This change is implementing option 1a -- adding a kstat at
zfs/<pool>/stats.json
which ignores any locking. This kstat can be used for investigations when pools are in a hung state while holding global locks required for a traditional'zpool status'
to proceed.NOTE: This kstat is not safe to use in conditions where pools are in the process of configuration changes (i.e., adding/removing devices). Therefore, this kstat is not intended to be a general replacement or alternative to using
'zpool status'
.Sponsored-by: Wasabi Technology, Inc.
Sponsored-By: Klara Inc.
How Has This Been Tested?
zpool_status_kstat_pos
test to validate the JSON outputsample kstat output (degraded mirror):
Types of changes
Checklist:
Signed-off-by
.